One of the big issues when it comes to working with data in any context is the issue of data cleaning and merging of datasets, since it is often the case that you will find yourself having to collate data across multiple files, and will need to rely on R to carry out functions that you would normally carry out using commands like VLOOKUP
in Excel.
For our example, we have two datasets:
mydata
.
customers.csv: This file contains the variables ID, Age, and Country. We load this into R under the name mydata2
.
The following are examples of popular techniques employed in R to clean a dataset, along with how to format variables effectively to facilitate analysis. The below functions work particularly well with panel datasets, where we have a mixture of cross-sectional and time series data:
To start off with a simple example, let us choose the customers dataset. Suppose that we only wish to include the variables ID and Age in our data. To do this, we define our data frame as follows:
dataframe<-data.frame(ID,Age)
Often times, it is necessary to combine two variables from different datasets similar to how VLOOKUP is used in Excel to join two variables based on certain criteria. In R, this can be done using the merge
function.
For instance, suppose that we wish to link the Date variable in the sales dataset with the Age and Country variables in the customers dataset – with the ID variable being the common link.
Therefore, we do as follows:
mergeinfo<-merge(mydata[, c("ID", "Sales")],mydata2[, c("ID", "Age", "Country")])
Upon doing this, we see that a new dataset is formed in R joining our chosen variables:
Suppose that we now wish to calculate the number of days between the current date and the date of sale as listed in the sales file. In order to accomplish this, we can use as.date as follows:
currentdate=as.Date('2016-12-15')
dateinfile=as.Date(Date)
Duration=currentdate-dateinfile
Going back to the example above, suppose that we now wish to combine this duration variable with the rest of our data.
Hence, we can now combine our new Duration variable with the merge
function as above, and can do this as follows:
durationasdouble=as.double.difftime(Duration, units='days')
updateddataframe=data.frame(ID,Sales,Date,durationasdouble)
updateddataframe
Let us look to the Country variable. Suppose that we wish to remove all instances of “Greenland” from our variable. This is accomplished using the grepl
command:
countryremoved<-mydata2[!grepl("Greenland", mydata2$Country),]
The head and tail functions can be used if we wish to delete certain observations from a variable, e.g. Sales. The head function allows us to delete the first 30 rows, while the tail function allows us to delete the last 30 rows.
When it comes to using a variable edited in this way for calculation purposes, e.g. a regression, the as.matrix function is also used to convert the variable into matrix format:
Salesminus30days←head(Sales,-30)
X1=as.matrix(Salesminus30days)
X1
Salesplus30days<-tail(Sales,-30)
X2=as.matrix(Salesplus30days)
X2
Let us suppose that we have created the following table as below, and want to obtain the sum of web visits and average minutes spent on a website in any particular period:
In this instance, we can replicate the SUMIF function in Excel (where the values associated with a specific identifier are summed up) by using the aggregate function in R. This can be done as follows (where raw_table is the table specified as above):
sumif_table<-aggregate(. ~ names, data=raw_table, sum)
sumif_table
Thus, the values associated with identifiers (in this case, names) are summed up as follows:
Hope you enjoyed this tutorial!
-> To get access to the datasets and full R code for this tutorial, please subscribe to my e-book package and receive your free guide titled, “R: Regression Analysis and Data Structuring Methods”.